Goto

Collaborating Authors

 Lowndes County


CoheMark: A Novel Sentence-Level Watermark for Enhanced Text Quality

Zhang, Junyan, Liu, Shuliang, Liu, Aiwei, Gao, Yubo, Li, Jungang, Gu, Xiaojie, Hu, Xuming

arXiv.org Artificial Intelligence

Watermarking technology is a method used to trace the usage of content generated by large language models. However, many existing sentence-level watermarking techniques depend on arbitrary segmentation or generation processes to embed watermarks, which can limit the availability of appropriate sentences. This limitation, in turn, compromises the quality of the generated response. To address the challenge of balancing high text quality with robust watermark detection, we propose CoheMark, an advanced sentence-level watermarking technique that exploits the cohesive relationships between sentences for better logical fluency. The core methodology of CoheMark involves selecting sentences through trained fuzzy c-means clustering and applying specific next sentence selection criteria. Experimental evaluations demonstrate that CoheMark achieves strong watermark strength while exerting minimal impact on text quality. In recent years, the rapid advancement of large language models (LLMs) has revolutionized natural language processing (OpenAI, 2023; Y ang et al., 2024; Touvron et al., 2023). This technological leap, while marking a significant milestone in artificial intelligence, has also brought about unprecedented challenges (Xu et al., 2024; Chen et al., 2023a; Mazeika et al., 2024). A major concern is that large language models can be exploited to generate false information and automated spam (Mirsky et al., 2023). To address this growing concern, researchers have begun focusing on developing various technologies to monitor AI-generated text and its usage. One effective way to track the usage of generated text is through watermarking, which involves embedding imperceptible information into the text (Kirchenbauer et al., 2023a; Kuditipudi et al., 2023; Zhao et al., 2023; Giboulot & Furon, 2024). This makes it easier to detect and track the text for potential misuse. Compared to token-level watermarking methods, sentence-level watermarking is advantageous for preserving the internal semantic fluency within individual sentences and provides greater robustness.